[GE users] sge_execd exits badly even killed cleanly

reuti reuti at staff.uni-marburg.de
Mon Jan 25 23:59:54 GMT 2010


Am 25.01.2010 um 16:08 schrieb massot:

> On Mon, Jan 25, 2010 at 03:15:03PM +0100, reuti wrote:
>> Then the application-level interface will provide more options. Did
>> you check the Howto:
>>
>> http://gridengine.sunsource.net/howto/checkpointing.html
>>
>> Inside the checkpoint.sh you will need to send something like `kill -
>> usr1 -- -$job_pid` to send the signal to the complete process group.
>> The available variables are:
>>
>> char *ckpt_variables[] = {
>>     "host",
>>     "job_owner",
>>     "job_id",
>>     "job_name",
>>     "queue",
>>     "job_pid",
>>     "ckpt_dir",
>>     "ckpt_signal",
>>     NULL
>> };
>>
>> Did you find this anywhere documented? It woud be another issue to
>> have these noted at a proper location. Some are mentioned in the
>> README.* in the $SGE_ROOT/ckpt though.
> Well, I saw some of them in
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf but they
> definitely deserve to be in sge_checkpoint manpage.
>
>> To summarize: to generate different signals (and even a checkpoint
>> before the shutdown of the execd), I think the only working option is
>> to use the application-level interface and suspend the queue on this
>> machine before it is shut down. This way you have a chance that the
>> migrate script is called for sure, where you can send a different
>> signal to your application than it is send from checkpointing script.
>> The migrate script is not called at all for the shutdown of the execd
> Actually I feel my current system is working and doesn't really  
> need all
> this complicated configuration.
> I'm going to teach my users what posix signals are and how to handle
> them in a C program. Then I'll tell them that, if you use the "-ckpt
> transp_usr1" option with qsub :
> * every $min_cpu_interval your job receives a USR1 signal, which lets
>   you some time to do a chekpoint, and you can use a bigger time
>   interval using "-c" qsub option ;

Yes.


> * when your job receives a TERM signal it has 10 seconds to do a
>   checkpoint, then it will be restarted on another node with same  
> job id
>   and $RESTARTED set to 1 ;

But the TERM does not come from the shutdown of the sgeexecd, but  
from the system shutdown. Just look out for race conditions, but when  
it's working for you it's fine.


> * if you use "-notify" qsub option, your job will receive a USR2  
> signal
>   if qdel is called, and then your job has $notify of time to do a
>   checkpoint ;

Yes, you can also redefine the signal.


> * in case of violent hardware failure, your job gets started on  
> another
>   node, with same job id and $RESTARTED set to 1 ;

Yes.


> * don't mind the misleading mails sent by SGE.

-- Reuti


> Do you think I'm missing something? My tests are rather convincing.
> -- 
> Bernard Massot
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=240903
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240984

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list