[GE users] Checkpointing on Mac Cluster ?

Barry McInnes Barry.J.Mcinnes at noaa.gov
Thu Oct 12 17:04:15 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

After the interactive stuff worked, I put everything back to original,
and now it works for example_1 !
[mac27:~/Checkpoint_Howto_Examples] bjm% cat
/usr/local/sge/checkpoint/checkpoint_1
Tue Oct 10 14:12:53 MDT 2006
Tue Oct 10 14:39:53 MDT 2006
Tue Oct 10 15:11:39 MDT 2006
Wed Oct 11 09:40:39 MDT 2006
Thu Oct 12 09:34:14 MDT 2006
Thu Oct 12 09:43:28 MDT 2006
Thu Oct 12 09:48:28 MDT 2006
Thu Oct 12 09:53:28 MDT 2006

On 10/11/06 3:56 PM, Reuti wrote:
> Am 11.10.2006 um 23:36 schrieb Barry McInnes:
> 
>> It got me stumped - it only works with a "`" character ?
>> A " or ' do nothing to the checkpoint_1 file output, using
>> trap `date ...` usr2
>> works even though it shouldn't.
> 
> Before we go deeper into this: can you please type my interactive
> examples from my last post, i.e. listed below, and post the results here?
> 
> Thx - Reuti
> 
> 
>> I checked other Mac X scripts with trap and they use the ' character.
>> I have tried using 31 instead of usr2 and USR2, with no difference it
>> dosnt work.
>>
>> Is there anything else I can try or set ?
>>
>> thanks barry
>>
>> On 10/11/06 12:08 PM, Reuti wrote:
>>> Hi Barry,
>>>
>>> Am 11.10.2006 um 19:19 schrieb Barry McInnes:
>>>
>>>> Hi Reuti,
>>>>
>>>>
>>>> I am on example 2 now and cant see why the following happens.
>>>> As a note
>>>> To get example 1 operational I needed to replace the trap line
>>>> characters ' with `
>>>> #!/bin/sh
>>>> # check_transparent1.sh
>>>>
>>>> #bjm add ` instead of '
>>>> trap `date >> $SGE_CKPT_DIR/checkpoint_1` usr2
>>>
>>> this shouldn't work. The command would be executed at the time you
>>> define the trap command, not later on during execution.
>>>
>>> On a Mac I get this:
>>>
>>> reuti at defiant:~> trap `date >> test` usr2
>>> reuti at defiant:~> cat test
>>> Wed Oct 11 19:54:35 CEST 2006
>>> reuti at defiant:~> trap -p
>>> reuti at defiant:~>
>>>
>>> There was never a usr2, and the file "test" is created during the
>>> defintion of the trap. In fact, as the result of the `` is just an empty
>>> string, any set trap before would be removed by this command. An
>>> interactive test:
>>>
>>> reuti at defiant:~> trap 'date' usr2
>>> reuti at defiant:~> trap -p
>>> trap -- 'date' SIGUSR2
>>> reuti at defiant:~> ps
>>>   PID  TT  STAT      TIME COMMAND
>>> 5588  p2  S      0:00.18 -bash
>>> reuti at defiant:~> kill -usr2 5588
>>> Wed Oct 11 20:00:36 CEST 2006
>>> reuti at defiant:~> kill -usr2 5588
>>> Wed Oct 11 20:00:48 CEST 2006
>>> reuti at defiant:~>
>>>
>>> So before going to ex #2, #1 must work in the intended way.
>>>
>>> Would did you observe for #1 in detail? You waited at least 5 minutes to
>>> get the checkpoint event?
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> echo "Script started."
>>>>
>>>> for ((i=0; i<100; i++)) ; do
>>>>     sleep 1
>>>> done
>>>>
>>>> echo "Script finished."
>>>>
>>>> exit 0
>>>>
>>>>
>>>> So I have done the same for example 2, on the trap line
>>>> qsub -q low -ckpt check_transparent check_transparent2.sh
>>>>
>>>> The job runs for 5 mins then stops, the checkpoint file had space
>>>> character, I added the extra line setting ACTUAL_VALUE to 0, so now the
>>>> output file ends up with 0 in it
>>>> [mac27:~/Checkpoint_Howto_Examples] bjm% ls -l
>>>> /usr/local/sge/checkpoint/checkpoint_2
>>>> -rw-r--r-- 1 bjm bin 2 Oct 11 10:14
>>>> /usr/local/sge/checkpoint/checkpoint_2
>>>> [mac27:~/Checkpoint_Howto_Examples] bjm% cat
>>>> /usr/local/sge/checkpoint/checkpoint_2
>>>> 0
>>>> [mac27:~/Checkpoint_Howto_Examples] bjm%
>>>>
>>>> the log file has at the end
>>>> Processing 294.
>>>> Processing 295.
>>>> Processing 296.
>>>>
>>>> So it looks like its traping, but never writes out 296 to the
>>>> checpoint_2 file ?
>>>>
>>>> #!/bin/sh
>>>> # check_transparent2.sh
>>>>
>>>> #bjm
>>>> export ACTUAL_VALUE=0
>>>>
>>>> trap `echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2` usr2
>>>>
>>>> #
>>>> # Check whether we are restarted and a checkpoint file is already
>>>> avaiualble.
>>>> #
>>>>
>>>> if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r
>>>> "$SGE_CKPT_DIR/checkpoint_2" ] ; then
>>>>     read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
>>>>     echo "Script restarted with value $ACTUAL_VALUE."
>>>> else
>>>>     ACTUAL_VALUE=1
>>>>     echo "Script started."
>>>> fi
>>>>
>>>> #
>>>> # Start of the program.
>>>> #
>>>>
>>>> while [ "$ACTUAL_VALUE" -le 1000 ] ; do
>>>>     echo "Processing $ACTUAL_VALUE."
>>>>     let ACTUAL_VALUE++
>>>>     sleep 1
>>>> done
>>>>
>>>> echo "Script finished."
>>>>
>>>> exit 0
>>>>
>>>>
>>>> My guess is this is a Mac problem, I tried zsh as well as sh with the
>>>> same results ??
>>>>
>>>> On 10/10/06 2:48 PM, Barry McInnes wrote:
>>>>> Thanks - that was the missing puzzle piece. First script works now, I
>>>>> had not read ahead where you have the parameter in the example.
>>>>> Onto the next tests...
>>>>>
>>>>> On 10/10/06 12:45 PM, Reuti wrote:
>>>>>> Am 10.10.2006 um 20:29 schrieb Barry McInnes:
>>>>>>
>>>>>>> I am going through the "Checkpointing of Serial Jobs" version 1.1a
>>>>>>> I am trying the transparent interface -
>>>>>>> created check_transparent via qmon, then
>>>>>>> qconf -mckpt check_transparent
>>>>>>> ckpt_name          check_transparent
>>>>>>> interface          TRANSPARENT
>>>>>>> ckpt_command       NONE
>>>>>>> migr_command       NONE
>>>>>>> restart_command    NONE
>>>>>>> clean_command      NONE
>>>>>>> ckpt_dir           /usr/local/sge/checkpoint
>>>>>>> signal             usr2
>>>>>>> when               xmr
>>>>>>> The low queue has time set
>>>>>>> qconf -sq low | grep cpu
>>>>>>> min_cpu_interval      00:05:00
>>>>>>>> From qmon, I added the check_transparent in low Checkpointing
>>>>>>>> window
>>>>>>> but SGE_CKPT_DIR is never set, when I run
>>>>>>> check_transparent1.sh, and put env as the first line, there
>>>>>>> is no CKPT variables.
>>>>>>> Should something else be turned on ?
>>>>>>>
>>>>>> You included -ckpt check_transparent in the qsub command? - Reuti
>>>>>>
>>>>>>> thanks barry
>>>>>>>
>>>>>>> -----
>>>>>>> Barry McInnes
>>>>>>> 325 Broadway
>>>>>>> Boulder CO 80304
>>>>>>> (303)4976231
>>>>>>> barry.j.mcinnes at noaa.gov
>>>>>>> ---
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>
>>>> -----
>>>> Barry McInnes
>>>> 325 Broadway
>>>> Boulder CO 80304
>>>> (303)4976231
>>>> barry.j.mcinnes at noaa.gov
>>>> ---
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> -----
>> Barry McInnes
>> 325 Broadway
>> Boulder CO 80304
>> (303)4976231
>> barry.j.mcinnes at noaa.gov
>> ---
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

-- 
---
Barry McInnes
325 Broadway
Boulder CO 80304
(303)4976231
barry.j.mcinnes at noaa.gov
---

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list