[GE users] Spurious errors with checkpointing

Ruppert dieter_ruppert at siemens.com
Mon Dec 10 11:00:01 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

...
>> ...
>>>
>>> what checkpointing enviroment did you setup? Can you try to use
>>> another signal, which is not used by SGE already for other tasks?
>>
>>
>> The checkpointing environment is set up as follows:
>>
>> ckpt_name          test_migr
>> interface          APPLICATION-LEVEL
>> ckpt_command       NONE
>> migr_command       kill -USR1 -$job_pid
>> restart_command    NONE
>> clean_command      NONE
>> ckpt_dir           /tmp
>> signal             NONE
>> when               xsr
>>
>> Any recommendations what signal to use instead of USR1? I was not  
>> aware
>> that USR1 is used by Gridengine internally.
>
>well, not really internally but for, but it's used as a warning  
>before a sigstop if you request -notify (unless redefined in the SGE  
>setup). One additonal option: instead of just one string which will  
>be executed with the kill command: specify a script containing the  
>sigusr1 and maybe a real sigkill a few seconds/minutes later. If  
>sigusr1 is caught by the application, the job will never get killed  
>otherwise. You could even put some diagnostic echo messages in this  
>script to check, whether the script is really executed too early.
>

We do not use -notify (job submission is done via wrapper scripts
exclusively, and these do not set -notify), so I assume that USR1 
is safe to use. One of the processes which constitute a job intercepts
this and does a cleanup and a clean termination of all processes
involved. The signal goes, therefore, to the whole process group.



More information about the gridengine-users mailing list