[GE users] Spurious errors with checkpointing

Reuti reuti at staff.uni-marburg.de
Mon Dec 10 09:53:51 GMT 2007


Hi,

Am 10.12.2007 um 10:09 schrieb Ruppert:

> ...
>>
>> what checkpointing enviroment did you setup? Can you try to use
>> another signal, which is not used by SGE already for other tasks?
>
>
> The checkpointing environment is set up as follows:
>
> ckpt_name          test_migr
> interface          APPLICATION-LEVEL
> ckpt_command       NONE
> migr_command       kill -USR1 -$job_pid
> restart_command    NONE
> clean_command      NONE
> ckpt_dir           /tmp
> signal             NONE
> when               xsr
>
> Any recommendations what signal to use instead of USR1? I was not  
> aware
> that USR1 is used by Gridengine internally.

well, not really internally but for, but it's used as a warning  
before a sigstop if you request -notify (unless redefined in the SGE  
setup). One additonal option: instead of just one string which will  
be executed with the kill command: specify a script containing the  
sigusr1 and maybe a real sigkill a few seconds/minutes later. If  
sigusr1 is caught by the application, the job will never get killed  
otherwise. You could even put some diagnostic echo messages in this  
script to check, whether the script is really executed too early.

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list