[GE users] Checkpointing on Mac Cluster ?

Barry McInnes Barry.J.Mcinnes at noaa.gov
Wed Oct 11 22:36:09 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

It got me stumped - it only works with a "`" character ?
A " or ' do nothing to the checkpoint_1 file output, using
trap `date ...` usr2
works even though it shouldn't.
I checked other Mac X scripts with trap and they use the ' character.
I have tried using 31 instead of usr2 and USR2, with no difference it
dosnt work.

Is there anything else I can try or set ?

thanks barry

On 10/11/06 12:08 PM, Reuti wrote:
> Hi Barry,
> 
> Am 11.10.2006 um 19:19 schrieb Barry McInnes:
> 
>> Hi Reuti,
>>
>>
>> I am on example 2 now and cant see why the following happens.
>> As a note
>> To get example 1 operational I needed to replace the trap line
>> characters ' with `
>> #!/bin/sh
>> # check_transparent1.sh
>>
>> #bjm add ` instead of '
>> trap `date >> $SGE_CKPT_DIR/checkpoint_1` usr2
> 
> this shouldn't work. The command would be executed at the time you
> define the trap command, not later on during execution.
> 
> On a Mac I get this:
> 
> reuti at defiant:~> trap `date >> test` usr2
> reuti at defiant:~> cat test
> Wed Oct 11 19:54:35 CEST 2006
> reuti at defiant:~> trap -p
> reuti at defiant:~>
> 
> There was never a usr2, and the file "test" is created during the
> defintion of the trap. In fact, as the result of the `` is just an empty
> string, any set trap before would be removed by this command. An
> interactive test:
> 
> reuti at defiant:~> trap 'date' usr2
> reuti at defiant:~> trap -p
> trap -- 'date' SIGUSR2
> reuti at defiant:~> ps
>   PID  TT  STAT      TIME COMMAND
> 5588  p2  S      0:00.18 -bash
> reuti at defiant:~> kill -usr2 5588
> Wed Oct 11 20:00:36 CEST 2006
> reuti at defiant:~> kill -usr2 5588
> Wed Oct 11 20:00:48 CEST 2006
> reuti at defiant:~>
> 
> So before going to ex #2, #1 must work in the intended way.
> 
> Would did you observe for #1 in detail? You waited at least 5 minutes to
> get the checkpoint event?
> 
> -- Reuti
> 
> 
>>
>> echo "Script started."
>>
>> for ((i=0; i<100; i++)) ; do
>>     sleep 1
>> done
>>
>> echo "Script finished."
>>
>> exit 0
>>
>>
>> So I have done the same for example 2, on the trap line
>> qsub -q low -ckpt check_transparent check_transparent2.sh
>>
>> The job runs for 5 mins then stops, the checkpoint file had space
>> character, I added the extra line setting ACTUAL_VALUE to 0, so now the
>> output file ends up with 0 in it
>> [mac27:~/Checkpoint_Howto_Examples] bjm% ls -l
>> /usr/local/sge/checkpoint/checkpoint_2
>> -rw-r--r-- 1 bjm bin 2 Oct 11 10:14
>> /usr/local/sge/checkpoint/checkpoint_2
>> [mac27:~/Checkpoint_Howto_Examples] bjm% cat
>> /usr/local/sge/checkpoint/checkpoint_2
>> 0
>> [mac27:~/Checkpoint_Howto_Examples] bjm%
>>
>> the log file has at the end
>> Processing 294.
>> Processing 295.
>> Processing 296.
>>
>> So it looks like its traping, but never writes out 296 to the
>> checpoint_2 file ?
>>
>> #!/bin/sh
>> # check_transparent2.sh
>>
>> #bjm
>> export ACTUAL_VALUE=0
>>
>> trap `echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2` usr2
>>
>> #
>> # Check whether we are restarted and a checkpoint file is already
>> avaiualble.
>> #
>>
>> if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r
>> "$SGE_CKPT_DIR/checkpoint_2" ] ; then
>>     read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
>>     echo "Script restarted with value $ACTUAL_VALUE."
>> else
>>     ACTUAL_VALUE=1
>>     echo "Script started."
>> fi
>>
>> #
>> # Start of the program.
>> #
>>
>> while [ "$ACTUAL_VALUE" -le 1000 ] ; do
>>     echo "Processing $ACTUAL_VALUE."
>>     let ACTUAL_VALUE++
>>     sleep 1
>> done
>>
>> echo "Script finished."
>>
>> exit 0
>>
>>
>> My guess is this is a Mac problem, I tried zsh as well as sh with the
>> same results ??
>>
>> On 10/10/06 2:48 PM, Barry McInnes wrote:
>>> Thanks - that was the missing puzzle piece. First script works now, I
>>> had not read ahead where you have the parameter in the example.
>>> Onto the next tests...
>>>
>>> On 10/10/06 12:45 PM, Reuti wrote:
>>>> Am 10.10.2006 um 20:29 schrieb Barry McInnes:
>>>>
>>>>> I am going through the "Checkpointing of Serial Jobs" version 1.1a
>>>>> I am trying the transparent interface -
>>>>> created check_transparent via qmon, then
>>>>> qconf -mckpt check_transparent
>>>>> ckpt_name          check_transparent
>>>>> interface          TRANSPARENT
>>>>> ckpt_command       NONE
>>>>> migr_command       NONE
>>>>> restart_command    NONE
>>>>> clean_command      NONE
>>>>> ckpt_dir           /usr/local/sge/checkpoint
>>>>> signal             usr2
>>>>> when               xmr
>>>>> The low queue has time set
>>>>> qconf -sq low | grep cpu
>>>>> min_cpu_interval      00:05:00
>>>>>> From qmon, I added the check_transparent in low Checkpointing window
>>>>> but SGE_CKPT_DIR is never set, when I run
>>>>> check_transparent1.sh, and put env as the first line, there
>>>>> is no CKPT variables.
>>>>> Should something else be turned on ?
>>>>>
>>>> You included -ckpt check_transparent in the qsub command? - Reuti
>>>>
>>>>> thanks barry
>>>>>
>>>>> -----
>>>>> Barry McInnes
>>>>> 325 Broadway
>>>>> Boulder CO 80304
>>>>> (303)4976231
>>>>> barry.j.mcinnes at noaa.gov
>>>>> ---
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>
>> -----
>> Barry McInnes
>> 325 Broadway
>> Boulder CO 80304
>> (303)4976231
>> barry.j.mcinnes at noaa.gov
>> ---
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
---
Barry McInnes
325 Broadway
Boulder CO 80304
(303)4976231
barry.j.mcinnes at noaa.gov
---

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list