[GE users] Integrating SGE with condor and BLCR

Neeraj Chourasia neeraj at crlindia.com
Wed Nov 28 11:29:32 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Thanks Reuti,
    Its working now, but i am not sure if it can be used to checkpoint 
Openmpi application. Since Openmpi doesnt have their own checkpointing 
implemented, can BLCR/Condor be extended to support checkpointing?

  I tried compiling simple MPI application with condor_compile, but its 
failed. Similarly BLCR says its for node level serial job checkpointing 
and hasn't been tested on MPI like parallel application.

-Neeraj
Reuti wrote:
> Hi,
>
> Am 28.11.2007 um 09:55 schrieb Neeraj Chourasia:
>
>> Hello Guys,
>>
>>    I tried integrating SGE with 3rd party checkpointing library say 
>> condor and BLCR, but unable to checkpoint the application. On 
>> searching the mailing list i found an issue below
>>
>>        http://gridengine.sunsource.net/issues/show_bug.cgi?id=2037
>>
>>   I am able to checkpoint the condor, if i manually send the 
>> application USR2 signal, but on suspending queue/job, the SGE is not 
>> checkpointing the application.
>> The configuration of Condor chekpoint is as follows
>
> you followed:
>
> http://gridengine.sunsource.net/howto/checkpointing.html
>
> and
>
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
>
> What I see is, that you didn't include "m" in the "when" option. This 
> you will need, or include in the BLCR checkpointing a call to the 
> checkpointing script in the migrate script. The state diagramms in Lip 
> Kian's Howto show the actual behavior of SGE. Checkpoints are only 
> created in "min_cpu_interval" time steps.
>
> -- Reuti
>
>
>> >  qconf -sckpt check_transparent
>> ckpt_name          check_transparent
>> interface          TRANSPARENT
>> ckpt_command       NONE
>> migr_command       NONE
>> restart_command    NONE
>> clean_command      NONE
>> ckpt_dir           /home/neeraj/checkpoint
>> signal             USR2
>> when               xs
>>
>>
>> Similarly for BLCR
>>
>> >qconf -sckpt BLCR
>> ckpt_name          BLCR
>> interface          APPLICATION-LEVEL
>> ckpt_command       
>> /home/neeraj/local/sge/ckpt/blcr/blcr_checkpoint.sh $job_id \
>>                   $job_pid $ckpt_dir
>> migr_command       /home/neeraj/local/sge/ckpt/blcr/blcr_migrate.sh 
>> $job_id \
>>                   $job_pid $ckpt_dir
>> restart_command    NONE
>> clean_command      /home/neeraj/local/sge/ckpt/blcr/blcr_clean.sh 
>> $job_id \
>>                   $job_pid $ckpt_dir
>> ckpt_dir           /home/neeraj/checkpoint
>> signal             NONE
>> when               xsr
>>
>>
>> Please help me...
>>
>> -Neeraj
>>
>> The information contained in this electronic message and any 
>> attachments to this message are intended for the exclusive use of the 
>> addressee(s) and may contain proprietary, confidential or privileged 
>> information. If you are not the intended recipient, you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender 
>> immediately and destroy all copies of this message and any 
>> attachments contained in it.
>>
>> Contact your Administrator for further information.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments contained in it.

Contact your Administrator for further information.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list