[GE users] Checkpointing on Mac Cluster ?

Barry McInnes Barry.J.Mcinnes at noaa.gov
Fri Oct 13 21:51:07 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I think the flakyness was due to the fact there were two check jobs in
"Eqw" - after I deleted them check 1 and 2 work fine.

On 10/12/06 10:04 AM, Barry McInnes wrote:
> After the interactive stuff worked, I put everything back to original,
> and now it works for example_1 !
> [mac27:~/Checkpoint_Howto_Examples] bjm% cat
> /usr/local/sge/checkpoint/checkpoint_1
> Tue Oct 10 14:12:53 MDT 2006
> Tue Oct 10 14:39:53 MDT 2006
> Tue Oct 10 15:11:39 MDT 2006
> Wed Oct 11 09:40:39 MDT 2006
> Thu Oct 12 09:34:14 MDT 2006
> Thu Oct 12 09:43:28 MDT 2006
> Thu Oct 12 09:48:28 MDT 2006
> Thu Oct 12 09:53:28 MDT 2006
> 
> On 10/11/06 3:56 PM, Reuti wrote:
>> Am 11.10.2006 um 23:36 schrieb Barry McInnes:
>>
>>> It got me stumped - it only works with a "`" character ?
>>> A " or ' do nothing to the checkpoint_1 file output, using
>>> trap `date ...` usr2
>>> works even though it shouldn't.
>> Before we go deeper into this: can you please type my interactive
>> examples from my last post, i.e. listed below, and post the results here?
>>
>> Thx - Reuti
>>
>>
>>> I checked other Mac X scripts with trap and they use the ' character.
>>> I have tried using 31 instead of usr2 and USR2, with no difference it
>>> dosnt work.
>>>
>>> Is there anything else I can try or set ?
>>>
>>> thanks barry
>>>
>>> On 10/11/06 12:08 PM, Reuti wrote:
>>>> Hi Barry,
>>>>
>>>> Am 11.10.2006 um 19:19 schrieb Barry McInnes:
>>>>
>>>>> Hi Reuti,
>>>>>
>>>>>
>>>>> I am on example 2 now and cant see why the following happens.
>>>>> As a note
>>>>> To get example 1 operational I needed to replace the trap line
>>>>> characters ' with `
>>>>> #!/bin/sh
>>>>> # check_transparent1.sh
>>>>>
>>>>> #bjm add ` instead of '
>>>>> trap `date >> $SGE_CKPT_DIR/checkpoint_1` usr2
>>>> this shouldn't work. The command would be executed at the time you
>>>> define the trap command, not later on during execution.
>>>>
>>>> On a Mac I get this:
>>>>
>>>> reuti at defiant:~> trap `date >> test` usr2
>>>> reuti at defiant:~> cat test
>>>> Wed Oct 11 19:54:35 CEST 2006
>>>> reuti at defiant:~> trap -p
>>>> reuti at defiant:~>
>>>>
>>>> There was never a usr2, and the file "test" is created during the
>>>> defintion of the trap. In fact, as the result of the `` is just an empty
>>>> string, any set trap before would be removed by this command. An
>>>> interactive test:
>>>>
>>>> reuti at defiant:~> trap 'date' usr2
>>>> reuti at defiant:~> trap -p
>>>> trap -- 'date' SIGUSR2
>>>> reuti at defiant:~> ps
>>>>   PID  TT  STAT      TIME COMMAND
>>>> 5588  p2  S      0:00.18 -bash
>>>> reuti at defiant:~> kill -usr2 5588
>>>> Wed Oct 11 20:00:36 CEST 2006
>>>> reuti at defiant:~> kill -usr2 5588
>>>> Wed Oct 11 20:00:48 CEST 2006
>>>> reuti at defiant:~>
>>>>
>>>> So before going to ex #2, #1 must work in the intended way.
>>>>
>>>> Would did you observe for #1 in detail? You waited at least 5 minutes to
>>>> get the checkpoint event?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> echo "Script started."
>>>>>
>>>>> for ((i=0; i<100; i++)) ; do
>>>>>     sleep 1
>>>>> done
>>>>>
>>>>> echo "Script finished."
>>>>>
>>>>> exit 0
>>>>>
>>>>>
>>>>> So I have done the same for example 2, on the trap line
>>>>> qsub -q low -ckpt check_transparent check_transparent2.sh
>>>>>
>>>>> The job runs for 5 mins then stops, the checkpoint file had space
>>>>> character, I added the extra line setting ACTUAL_VALUE to 0, so now the
>>>>> output file ends up with 0 in it
>>>>> [mac27:~/Checkpoint_Howto_Examples] bjm% ls -l
>>>>> /usr/local/sge/checkpoint/checkpoint_2
>>>>> -rw-r--r-- 1 bjm bin 2 Oct 11 10:14
>>>>> /usr/local/sge/checkpoint/checkpoint_2
>>>>> [mac27:~/Checkpoint_Howto_Examples] bjm% cat
>>>>> /usr/local/sge/checkpoint/checkpoint_2
>>>>> 0
>>>>> [mac27:~/Checkpoint_Howto_Examples] bjm%
>>>>>
>>>>> the log file has at the end
>>>>> Processing 294.
>>>>> Processing 295.
>>>>> Processing 296.
>>>>>
>>>>> So it looks like its traping, but never writes out 296 to the
>>>>> checpoint_2 file ?
>>>>>
>>>>> #!/bin/sh
>>>>> # check_transparent2.sh
>>>>>
>>>>> #bjm
>>>>> export ACTUAL_VALUE=0
>>>>>
>>>>> trap `echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2` usr2
>>>>>
>>>>> #
>>>>> # Check whether we are restarted and a checkpoint file is already
>>>>> avaiualble.
>>>>> #
>>>>>
>>>>> if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r
>>>>> "$SGE_CKPT_DIR/checkpoint_2" ] ; then
>>>>>     read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
>>>>>     echo "Script restarted with value $ACTUAL_VALUE."
>>>>> else
>>>>>     ACTUAL_VALUE=1
>>>>>     echo "Script started."
>>>>> fi
>>>>>
>>>>> #
>>>>> # Start of the program.
>>>>> #
>>>>>
>>>>> while [ "$ACTUAL_VALUE" -le 1000 ] ; do
>>>>>     echo "Processing $ACTUAL_VALUE."
>>>>>     let ACTUAL_VALUE++
>>>>>     sleep 1
>>>>> done
>>>>>
>>>>> echo "Script finished."
>>>>>
>>>>> exit 0
>>>>>
>>>>>
>>>>> My guess is this is a Mac problem, I tried zsh as well as sh with the
>>>>> same results ??
>>>>>
>>>>> On 10/10/06 2:48 PM, Barry McInnes wrote:
>>>>>> Thanks - that was the missing puzzle piece. First script works now, I
>>>>>> had not read ahead where you have the parameter in the example.
>>>>>> Onto the next tests...
>>>>>>
>>>>>> On 10/10/06 12:45 PM, Reuti wrote:
>>>>>>> Am 10.10.2006 um 20:29 schrieb Barry McInnes:
>>>>>>>
>>>>>>>> I am going through the "Checkpointing of Serial Jobs" version 1.1a
>>>>>>>> I am trying the transparent interface -
>>>>>>>> created check_transparent via qmon, then
>>>>>>>> qconf -mckpt check_transparent
>>>>>>>> ckpt_name          check_transparent
>>>>>>>> interface          TRANSPARENT
>>>>>>>> ckpt_command       NONE
>>>>>>>> migr_command       NONE
>>>>>>>> restart_command    NONE
>>>>>>>> clean_command      NONE
>>>>>>>> ckpt_dir           /usr/local/sge/checkpoint
>>>>>>>> signal             usr2
>>>>>>>> when               xmr
>>>>>>>> The low queue has time set
>>>>>>>> qconf -sq low | grep cpu
>>>>>>>> min_cpu_interval      00:05:00
>>>>>>>>> From qmon, I added the check_transparent in low Checkpointing
>>>>>>>>> window
>>>>>>>> but SGE_CKPT_DIR is never set, when I run
>>>>>>>> check_transparent1.sh, and put env as the first line, there
>>>>>>>> is no CKPT variables.
>>>>>>>> Should something else be turned on ?
>>>>>>>>
>>>>>>> You included -ckpt check_transparent in the qsub command? - Reuti
>>>>>>>
>>>>>>>> thanks barry
>>>>>>>>
>>>>>>>> -----
>>>>>>>> Barry McInnes
>>>>>>>> 325 Broadway
>>>>>>>> Boulder CO 80304
>>>>>>>> (303)4976231
>>>>>>>> barry.j.mcinnes at noaa.gov
>>>>>>>> ---
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>> -----
>>>>> Barry McInnes
>>>>> 325 Broadway
>>>>> Boulder CO 80304
>>>>> (303)4976231
>>>>> barry.j.mcinnes at noaa.gov
>>>>> ---
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>> -----
>>> Barry McInnes
>>> 325 Broadway
>>> Boulder CO 80304
>>> (303)4976231
>>> barry.j.mcinnes at noaa.gov
>>> ---
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 

-- 
---
Barry McInnes
325 Broadway
Boulder CO 80304
(303)4976231
barry.j.mcinnes at noaa.gov
---

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list