[GE users] Integrating SGE with condor and BLCR

Reuti reuti at staff.uni-marburg.de
Wed Nov 28 12:36:37 GMT 2007


Am 28.11.2007 um 12:29 schrieb Neeraj Chourasia:

> Thanks Reuti,
>    Its working now, but i am not sure if it can be used to  
> checkpoint Openmpi application. Since Openmpi doesnt have their own  
> checkpointing implemented, can BLCR/Condor be extended to support  
> checkpointing?

OpenMPI 1.3 will have built-in checkpointing AFAIK. So I would wait  
for this release. Checkpointing of parallel apps is by far more  
complicated than serial ones.

Only option for now would be to build application-level checkpointing  
into your application, i.e. the rank 0 process has to write the  
computed data and state of the program to the checkpointing file from  
time to time and resume from this (like outlined for a serial  
application in my Howto).

--Reuti


>  I tried compiling simple MPI application with condor_compile, but  
> its failed. Similarly BLCR says its for node level serial job  
> checkpointing and hasn't been tested on MPI like parallel application.
>
> -Neeraj
> Reuti wrote:
>> Hi,
>>
>> Am 28.11.2007 um 09:55 schrieb Neeraj Chourasia:
>>
>>> Hello Guys,
>>>
>>>    I tried integrating SGE with 3rd party checkpointing library  
>>> say condor and BLCR, but unable to checkpoint the application. On  
>>> searching the mailing list i found an issue below
>>>
>>>        http://gridengine.sunsource.net/issues/show_bug.cgi?id=2037
>>>
>>>   I am able to checkpoint the condor, if i manually send the  
>>> application USR2 signal, but on suspending queue/job, the SGE is  
>>> not checkpointing the application.
>>> The configuration of Condor chekpoint is as follows
>>
>> you followed:
>>
>> http://gridengine.sunsource.net/howto/checkpointing.html
>>
>> and
>>
>> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
>>
>> What I see is, that you didn't include "m" in the "when" option.  
>> This you will need, or include in the BLCR checkpointing a call to  
>> the checkpointing script in the migrate script. The state  
>> diagramms in Lip Kian's Howto show the actual behavior of SGE.  
>> Checkpoints are only created in "min_cpu_interval" time steps.
>>
>> -- Reuti
>>
>>
>>> >  qconf -sckpt check_transparent
>>> ckpt_name          check_transparent
>>> interface          TRANSPARENT
>>> ckpt_command       NONE
>>> migr_command       NONE
>>> restart_command    NONE
>>> clean_command      NONE
>>> ckpt_dir           /home/neeraj/checkpoint
>>> signal             USR2
>>> when               xs
>>>
>>>
>>> Similarly for BLCR
>>>
>>> >qconf -sckpt BLCR
>>> ckpt_name          BLCR
>>> interface          APPLICATION-LEVEL
>>> ckpt_command       /home/neeraj/local/sge/ckpt/blcr/ 
>>> blcr_checkpoint.sh $job_id \
>>>                   $job_pid $ckpt_dir
>>> migr_command       /home/neeraj/local/sge/ckpt/blcr/ 
>>> blcr_migrate.sh $job_id \
>>>                   $job_pid $ckpt_dir
>>> restart_command    NONE
>>> clean_command      /home/neeraj/local/sge/ckpt/blcr/blcr_clean.sh  
>>> $job_id \
>>>                   $job_pid $ckpt_dir
>>> ckpt_dir           /home/neeraj/checkpoint
>>> signal             NONE
>>> when               xsr
>>>
>>>
>>> Please help me...
>>>
>>> -Neeraj
>>>
>>> The information contained in this electronic message and any  
>>> attachments to this message are intended for the exclusive use of  
>>> the addressee(s) and may contain proprietary, confidential or  
>>> privileged information. If you are not the intended recipient,  
>>> you should not disseminate, distribute or copy this e-mail.  
>>> Please notify the sender immediately and destroy all copies of  
>>> this message and any attachments contained in it.
>>>
>>> Contact your Administrator for further information.
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>
> The information contained in this electronic message and any  
> attachments to this message are intended for the exclusive use of  
> the addressee(s) and may contain proprietary, confidential or  
> privileged information. If you are not the intended recipient, you  
> should not disseminate, distribute or copy this e-mail. Please  
> notify the sender immediately and destroy all copies of this  
> message and any attachments contained in it.
>
> Contact your Administrator for further information.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list