[GE users] sge_execd exits badly even killed cleanly

massot bernard.massot at ens.fr
Tue Jan 26 09:02:25 GMT 2010


Thank you very much for your help, Reuti. A few last comments.

On Tue, Jan 26, 2010 at 12:59:54AM +0100, reuti wrote:
> > * when your job receives a TERM signal it has 10 seconds to do a
> >   checkpoint, then it will be restarted on another node with same  
> >   job id and $RESTARTED set to 1 ;
> But the TERM does not come from the shutdown of the sgeexecd, but  
> from the system shutdown. Just look out for race conditions, but when  
> it's working for you it's fine.
As we saw, even when asked to do so, sge_execd doesn't send signal on
shutdown. Anyway I'm going to remove when=s flag from my checkpointing
environment.
 
> > * if you use "-notify" qsub option, your job will receive a USR2  
> >   signal if qdel is called, and then your job has $notify of time to
> >   do a checkpoint ;
> Yes, you can also redefine the signal.
Thank you for remembering me. This feature is explained in sge_conf man
page, but neither in qsub's nor in queue_conf's. I think it should be at
least mentioned in last two ones.

> > * don't mind the misleading mails sent by SGE.
I believe that, when starting, sge_execd should ask qmaster whether a
job is succesfully running (or has successfully run) before sending a
mail saying it failed. What do you think about this behavior?
-- 
Bernard Massot

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241065

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list