[GE users] sge_execd exits badly even killed cleanly

reuti reuti at staff.uni-marburg.de
Tue Jan 26 15:38:13 GMT 2010

Am 26.01.2010 um 10:02 schrieb massot:

> Thank you very much for your help, Reuti. A few last comments.
> On Tue, Jan 26, 2010 at 12:59:54AM +0100, reuti wrote:
>>> * when your job receives a TERM signal it has 10 seconds to do a
>>>   checkpoint, then it will be restarted on another node with same
>>>   job id and $RESTARTED set to 1 ;
>> But the TERM does not come from the shutdown of the sgeexecd, but
>> from the system shutdown. Just look out for race conditions, but when
>> it's working for you it's fine.
> As we saw, even when asked to do so, sge_execd doesn't send signal on
> shutdown. Anyway I'm going to remove when=s flag from my checkpointing
> environment.
>>> * if you use "-notify" qsub option, your job will receive a USR2
>>>   signal if qdel is called, and then your job has $notify of time to
>>>   do a checkpoint ;
>> Yes, you can also redefine the signal.
> Thank you for remembering me. This feature is explained in sge_conf  
> man
> page, but neither in qsub's nor in queue_conf's. I think it should  
> be at
> least mentioned in last two ones.

Yes, this can help. Can you please file an issue for it.

-- Reuti

>>> * don't mind the misleading mails sent by SGE.
> I believe that, when starting, sge_execd should ask qmaster whether a
> job is succesfully running (or has successfully run) before sending a
> mail saying it failed. What do you think about this behavior?
> -- 
> Bernard Massot
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=241065
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list