[GE users] Checkpointing on Mac Cluster ?

Barry McInnes Barry.J.Mcinnes at noaa.gov
Wed Oct 11 18:19:05 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,


I am on example 2 now and cant see why the following happens.
As a note
To get example 1 operational I needed to replace the trap line
characters ' with `
#!/bin/sh
# check_transparent1.sh

#bjm add ` instead of '
trap `date >> $SGE_CKPT_DIR/checkpoint_1` usr2

echo "Script started."

for ((i=0; i<100; i++)) ; do
    sleep 1
done

echo "Script finished."

exit 0


So I have done the same for example 2, on the trap line
qsub -q low -ckpt check_transparent check_transparent2.sh

The job runs for 5 mins then stops, the checkpoint file had space
character, I added the extra line setting ACTUAL_VALUE to 0, so now the
output file ends up with 0 in it
[mac27:~/Checkpoint_Howto_Examples] bjm% ls -l
/usr/local/sge/checkpoint/checkpoint_2
-rw-r--r-- 1 bjm bin 2 Oct 11 10:14 /usr/local/sge/checkpoint/checkpoint_2
[mac27:~/Checkpoint_Howto_Examples] bjm% cat
/usr/local/sge/checkpoint/checkpoint_2
0
[mac27:~/Checkpoint_Howto_Examples] bjm%

the log file has at the end
Processing 294.
Processing 295.
Processing 296.

So it looks like its traping, but never writes out 296 to the
checpoint_2 file ?

#!/bin/sh
# check_transparent2.sh

#bjm
export ACTUAL_VALUE=0

trap `echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2` usr2

#
# Check whether we are restarted and a checkpoint file is already
avaiualble.
#

if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r
"$SGE_CKPT_DIR/checkpoint_2" ] ; then
    read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
    echo "Script restarted with value $ACTUAL_VALUE."
else
    ACTUAL_VALUE=1
    echo "Script started."
fi

#
# Start of the program.
#

while [ "$ACTUAL_VALUE" -le 1000 ] ; do
    echo "Processing $ACTUAL_VALUE."
    let ACTUAL_VALUE++
    sleep 1
done

echo "Script finished."

exit 0


My guess is this is a Mac problem, I tried zsh as well as sh with the
same results ??

On 10/10/06 2:48 PM, Barry McInnes wrote:
> Thanks - that was the missing puzzle piece. First script works now, I
> had not read ahead where you have the parameter in the example.
> Onto the next tests...
> 
> On 10/10/06 12:45 PM, Reuti wrote:
>> Am 10.10.2006 um 20:29 schrieb Barry McInnes:
>>
>>> I am going through the "Checkpointing of Serial Jobs" version 1.1a
>>> I am trying the transparent interface -
>>> created check_transparent via qmon, then
>>> qconf -mckpt check_transparent
>>> ckpt_name          check_transparent
>>> interface          TRANSPARENT
>>> ckpt_command       NONE
>>> migr_command       NONE
>>> restart_command    NONE
>>> clean_command      NONE
>>> ckpt_dir           /usr/local/sge/checkpoint
>>> signal             usr2
>>> when               xmr
>>> The low queue has time set
>>> qconf -sq low | grep cpu
>>> min_cpu_interval      00:05:00
>>>> From qmon, I added the check_transparent in low Checkpointing window
>>> but SGE_CKPT_DIR is never set, when I run
>>> check_transparent1.sh, and put env as the first line, there
>>> is no CKPT variables.
>>> Should something else be turned on ?
>>>
>> You included -ckpt check_transparent in the qsub command? - Reuti
>>
>>> thanks barry
>>>
>>> -- 
>>> ---
>>> Barry McInnes
>>> 325 Broadway
>>> Boulder CO 80304
>>> (303)4976231
>>> barry.j.mcinnes at noaa.gov
>>> ---
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 

-- 
---
Barry McInnes
325 Broadway
Boulder CO 80304
(303)4976231
barry.j.mcinnes at noaa.gov
---

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list