[GE users] how to checkpoint cadence simulator spectre

Jan Sundermeyer jan.sundermeyer at iis.fraunhofer.de
Tue Nov 6 07:35:47 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti schrieb:
> Hi,
> 
> Am 05.11.2007 um 16:27 schrieb Jan Sundermeyer:
> 
>> Hello,
>>
>> we have recently installed SGE and we want to use with cadence spectre
>> simulation.
>> Right now i try to set up check pointing for the simulator.
>>
>> Theoretically this should be quite simple:
>>
>> spectre writes a checkpoint on the reception of SIGUSR2 and terminates
>> itself.
>> If it is rerun with the option "+recover" and a checkpoint file is
>> present it continues to simulate from that saved point.
>>
>> However i have failed to get it to work with sge6.1u2.
>>
>> 1) checkpoint via the transparent mode does not work.
>> If i want to let generate a checkpoint on suspend, the process gets
>> killed.
>> If i let it checkpoint on reschedule, no checkpoint is written but it
>> tries to jump over the first simulation steps on rerun, a rather
>> unexpected behaviour,
> 
> some nice state diagrams you can find in this document:
> 
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
> 
>> 2) checkpoint via application level mode does not work (at least if i
>> want to checkpoint on suspend) as the process gets suspended first
>> before it can receive SIGUSR2, thus no checkpoint is written
> 
> Yes, the checkpoint has to be created before the job gets suspended.
> This can be done in the migration procedure however. This might explain
> some details:
> 
> http://gridengine.sunsource.net/howto/checkpointing.html
> 
> Just to note, that signal will be send to the complete processgroup, so
> some trapping in the shell-script might be necessary. Otherwise the job
> appears to be killed by just getting a usr2 signal.
> 
> -- Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

The hint about the group process id and the method mentioned in
APSTC-TB-2004-005.pdf solved the problem.

Right now i use application interface for checkpoint, applying the
following script for checkpointing and migration.
The argument of the script is the $job_pid.

#!/bin/sh
#
# spectre_migrate.sh
#
cpid=`pstree -p $1|awk -F "(" '{  print $NF }' |awk -F ")" '{ print $1 }'`
kill -s SIGINT $cpid


 Jan

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list