[GE users] Checkpointing on Mac Cluster ?

Reuti reuti at staff.uni-marburg.de
Wed Oct 11 22:56:25 BST 2006


Am 11.10.2006 um 23:36 schrieb Barry McInnes:

> It got me stumped - it only works with a "`" character ?
> A " or ' do nothing to the checkpoint_1 file output, using
> trap `date ...` usr2
> works even though it shouldn't.

Before we go deeper into this: can you please type my interactive  
examples from my last post, i.e. listed below, and post the results  
here?

Thx - Reuti


> I checked other Mac X scripts with trap and they use the ' character.
> I have tried using 31 instead of usr2 and USR2, with no difference it
> dosnt work.
>
> Is there anything else I can try or set ?
>
> thanks barry
>
> On 10/11/06 12:08 PM, Reuti wrote:
>> Hi Barry,
>>
>> Am 11.10.2006 um 19:19 schrieb Barry McInnes:
>>
>>> Hi Reuti,
>>>
>>>
>>> I am on example 2 now and cant see why the following happens.
>>> As a note
>>> To get example 1 operational I needed to replace the trap line
>>> characters ' with `
>>> #!/bin/sh
>>> # check_transparent1.sh
>>>
>>> #bjm add ` instead of '
>>> trap `date >> $SGE_CKPT_DIR/checkpoint_1` usr2
>>
>> this shouldn't work. The command would be executed at the time you
>> define the trap command, not later on during execution.
>>
>> On a Mac I get this:
>>
>> reuti at defiant:~> trap `date >> test` usr2
>> reuti at defiant:~> cat test
>> Wed Oct 11 19:54:35 CEST 2006
>> reuti at defiant:~> trap -p
>> reuti at defiant:~>
>>
>> There was never a usr2, and the file "test" is created during the
>> defintion of the trap. In fact, as the result of the `` is just an  
>> empty
>> string, any set trap before would be removed by this command. An
>> interactive test:
>>
>> reuti at defiant:~> trap 'date' usr2
>> reuti at defiant:~> trap -p
>> trap -- 'date' SIGUSR2
>> reuti at defiant:~> ps
>>   PID  TT  STAT      TIME COMMAND
>> 5588  p2  S      0:00.18 -bash
>> reuti at defiant:~> kill -usr2 5588
>> Wed Oct 11 20:00:36 CEST 2006
>> reuti at defiant:~> kill -usr2 5588
>> Wed Oct 11 20:00:48 CEST 2006
>> reuti at defiant:~>
>>
>> So before going to ex #2, #1 must work in the intended way.
>>
>> Would did you observe for #1 in detail? You waited at least 5  
>> minutes to
>> get the checkpoint event?
>>
>> -- Reuti
>>
>>
>>>
>>> echo "Script started."
>>>
>>> for ((i=0; i<100; i++)) ; do
>>>     sleep 1
>>> done
>>>
>>> echo "Script finished."
>>>
>>> exit 0
>>>
>>>
>>> So I have done the same for example 2, on the trap line
>>> qsub -q low -ckpt check_transparent check_transparent2.sh
>>>
>>> The job runs for 5 mins then stops, the checkpoint file had space
>>> character, I added the extra line setting ACTUAL_VALUE to 0, so  
>>> now the
>>> output file ends up with 0 in it
>>> [mac27:~/Checkpoint_Howto_Examples] bjm% ls -l
>>> /usr/local/sge/checkpoint/checkpoint_2
>>> -rw-r--r-- 1 bjm bin 2 Oct 11 10:14
>>> /usr/local/sge/checkpoint/checkpoint_2
>>> [mac27:~/Checkpoint_Howto_Examples] bjm% cat
>>> /usr/local/sge/checkpoint/checkpoint_2
>>> 0
>>> [mac27:~/Checkpoint_Howto_Examples] bjm%
>>>
>>> the log file has at the end
>>> Processing 294.
>>> Processing 295.
>>> Processing 296.
>>>
>>> So it looks like its traping, but never writes out 296 to the
>>> checpoint_2 file ?
>>>
>>> #!/bin/sh
>>> # check_transparent2.sh
>>>
>>> #bjm
>>> export ACTUAL_VALUE=0
>>>
>>> trap `echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2` usr2
>>>
>>> #
>>> # Check whether we are restarted and a checkpoint file is already
>>> avaiualble.
>>> #
>>>
>>> if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r
>>> "$SGE_CKPT_DIR/checkpoint_2" ] ; then
>>>     read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
>>>     echo "Script restarted with value $ACTUAL_VALUE."
>>> else
>>>     ACTUAL_VALUE=1
>>>     echo "Script started."
>>> fi
>>>
>>> #
>>> # Start of the program.
>>> #
>>>
>>> while [ "$ACTUAL_VALUE" -le 1000 ] ; do
>>>     echo "Processing $ACTUAL_VALUE."
>>>     let ACTUAL_VALUE++
>>>     sleep 1
>>> done
>>>
>>> echo "Script finished."
>>>
>>> exit 0
>>>
>>>
>>> My guess is this is a Mac problem, I tried zsh as well as sh with  
>>> the
>>> same results ??
>>>
>>> On 10/10/06 2:48 PM, Barry McInnes wrote:
>>>> Thanks - that was the missing puzzle piece. First script works  
>>>> now, I
>>>> had not read ahead where you have the parameter in the example.
>>>> Onto the next tests...
>>>>
>>>> On 10/10/06 12:45 PM, Reuti wrote:
>>>>> Am 10.10.2006 um 20:29 schrieb Barry McInnes:
>>>>>
>>>>>> I am going through the "Checkpointing of Serial Jobs" version  
>>>>>> 1.1a
>>>>>> I am trying the transparent interface -
>>>>>> created check_transparent via qmon, then
>>>>>> qconf -mckpt check_transparent
>>>>>> ckpt_name          check_transparent
>>>>>> interface          TRANSPARENT
>>>>>> ckpt_command       NONE
>>>>>> migr_command       NONE
>>>>>> restart_command    NONE
>>>>>> clean_command      NONE
>>>>>> ckpt_dir           /usr/local/sge/checkpoint
>>>>>> signal             usr2
>>>>>> when               xmr
>>>>>> The low queue has time set
>>>>>> qconf -sq low | grep cpu
>>>>>> min_cpu_interval      00:05:00
>>>>>>> From qmon, I added the check_transparent in low Checkpointing  
>>>>>>> window
>>>>>> but SGE_CKPT_DIR is never set, when I run
>>>>>> check_transparent1.sh, and put env as the first line, there
>>>>>> is no CKPT variables.
>>>>>> Should something else be turned on ?
>>>>>>
>>>>> You included -ckpt check_transparent in the qsub command? - Reuti
>>>>>
>>>>>> thanks barry
>>>>>>
>>>>>> -----
>>>>>> Barry McInnes
>>>>>> 325 Broadway
>>>>>> Boulder CO 80304
>>>>>> (303)4976231
>>>>>> barry.j.mcinnes at noaa.gov
>>>>>> ---
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ----
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- 
>>>>>> help at gridengine.sunsource.net
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>
>>> -----
>>> Barry McInnes
>>> 325 Broadway
>>> Boulder CO 80304
>>> (303)4976231
>>> barry.j.mcinnes at noaa.gov
>>> ---
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> -- 
> ---
> Barry McInnes
> 325 Broadway
> Boulder CO 80304
> (303)4976231
> barry.j.mcinnes at noaa.gov
> ---
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list