[GE users] Checkpointing on Mac Cluster ?

Reuti reuti at staff.uni-marburg.de
Wed Oct 11 19:08:06 BST 2006


Hi Barry,

Am 11.10.2006 um 19:19 schrieb Barry McInnes:

> Hi Reuti,
>
>
> I am on example 2 now and cant see why the following happens.
> As a note
> To get example 1 operational I needed to replace the trap line
> characters ' with `
> #!/bin/sh
> # check_transparent1.sh
>
> #bjm add ` instead of '
> trap `date >> $SGE_CKPT_DIR/checkpoint_1` usr2

this shouldn't work. The command would be executed at the time you  
define the trap command, not later on during execution.

On a Mac I get this:

reuti at defiant:~> trap `date >> test` usr2
reuti at defiant:~> cat test
Wed Oct 11 19:54:35 CEST 2006
reuti at defiant:~> trap -p
reuti at defiant:~>

There was never a usr2, and the file "test" is created during the  
defintion of the trap. In fact, as the result of the `` is just an  
empty string, any set trap before would be removed by this command.  
An interactive test:

reuti at defiant:~> trap 'date' usr2
reuti at defiant:~> trap -p
trap -- 'date' SIGUSR2
reuti at defiant:~> ps
   PID  TT  STAT      TIME COMMAND
5588  p2  S      0:00.18 -bash
reuti at defiant:~> kill -usr2 5588
Wed Oct 11 20:00:36 CEST 2006
reuti at defiant:~> kill -usr2 5588
Wed Oct 11 20:00:48 CEST 2006
reuti at defiant:~>

So before going to ex #2, #1 must work in the intended way.

Would did you observe for #1 in detail? You waited at least 5 minutes  
to get the checkpoint event?

-- Reuti


>
> echo "Script started."
>
> for ((i=0; i<100; i++)) ; do
>     sleep 1
> done
>
> echo "Script finished."
>
> exit 0
>
>
> So I have done the same for example 2, on the trap line
> qsub -q low -ckpt check_transparent check_transparent2.sh
>
> The job runs for 5 mins then stops, the checkpoint file had space
> character, I added the extra line setting ACTUAL_VALUE to 0, so now  
> the
> output file ends up with 0 in it
> [mac27:~/Checkpoint_Howto_Examples] bjm% ls -l
> /usr/local/sge/checkpoint/checkpoint_2
> -rw-r--r-- 1 bjm bin 2 Oct 11 10:14 /usr/local/sge/checkpoint/ 
> checkpoint_2
> [mac27:~/Checkpoint_Howto_Examples] bjm% cat
> /usr/local/sge/checkpoint/checkpoint_2
> 0
> [mac27:~/Checkpoint_Howto_Examples] bjm%
>
> the log file has at the end
> Processing 294.
> Processing 295.
> Processing 296.
>
> So it looks like its traping, but never writes out 296 to the
> checpoint_2 file ?
>
> #!/bin/sh
> # check_transparent2.sh
>
> #bjm
> export ACTUAL_VALUE=0
>
> trap `echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2` usr2
>
> #
> # Check whether we are restarted and a checkpoint file is already
> avaiualble.
> #
>
> if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r
> "$SGE_CKPT_DIR/checkpoint_2" ] ; then
>     read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
>     echo "Script restarted with value $ACTUAL_VALUE."
> else
>     ACTUAL_VALUE=1
>     echo "Script started."
> fi
>
> #
> # Start of the program.
> #
>
> while [ "$ACTUAL_VALUE" -le 1000 ] ; do
>     echo "Processing $ACTUAL_VALUE."
>     let ACTUAL_VALUE++
>     sleep 1
> done
>
> echo "Script finished."
>
> exit 0
>
>
> My guess is this is a Mac problem, I tried zsh as well as sh with the
> same results ??
>
> On 10/10/06 2:48 PM, Barry McInnes wrote:
>> Thanks - that was the missing puzzle piece. First script works now, I
>> had not read ahead where you have the parameter in the example.
>> Onto the next tests...
>>
>> On 10/10/06 12:45 PM, Reuti wrote:
>>> Am 10.10.2006 um 20:29 schrieb Barry McInnes:
>>>
>>>> I am going through the "Checkpointing of Serial Jobs" version 1.1a
>>>> I am trying the transparent interface -
>>>> created check_transparent via qmon, then
>>>> qconf -mckpt check_transparent
>>>> ckpt_name          check_transparent
>>>> interface          TRANSPARENT
>>>> ckpt_command       NONE
>>>> migr_command       NONE
>>>> restart_command    NONE
>>>> clean_command      NONE
>>>> ckpt_dir           /usr/local/sge/checkpoint
>>>> signal             usr2
>>>> when               xmr
>>>> The low queue has time set
>>>> qconf -sq low | grep cpu
>>>> min_cpu_interval      00:05:00
>>>>> From qmon, I added the check_transparent in low Checkpointing  
>>>>> window
>>>> but SGE_CKPT_DIR is never set, when I run
>>>> check_transparent1.sh, and put env as the first line, there
>>>> is no CKPT variables.
>>>> Should something else be turned on ?
>>>>
>>> You included -ckpt check_transparent in the qsub command? - Reuti
>>>
>>>> thanks barry
>>>>
>>>> -- 
>>>> ---
>>>> Barry McInnes
>>>> 325 Broadway
>>>> Boulder CO 80304
>>>> (303)4976231
>>>> barry.j.mcinnes at noaa.gov
>>>> ---
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>
> -- 
> ---
> Barry McInnes
> 325 Broadway
> Boulder CO 80304
> (303)4976231
> barry.j.mcinnes at noaa.gov
> ---
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list