[GE users] sge_execd exits badly even killed cleanly
bernard.massot at ens.fr
Tue Jan 26 09:02:25 GMT 2010
Thank you very much for your help, Reuti. A few last comments.
On Tue, Jan 26, 2010 at 12:59:54AM +0100, reuti wrote:
> > * when your job receives a TERM signal it has 10 seconds to do a
> > checkpoint, then it will be restarted on another node with same
> > job id and $RESTARTED set to 1 ;
> But the TERM does not come from the shutdown of the sgeexecd, but
> from the system shutdown. Just look out for race conditions, but when
> it's working for you it's fine.
As we saw, even when asked to do so, sge_execd doesn't send signal on
shutdown. Anyway I'm going to remove when=s flag from my checkpointing
> > * if you use "-notify" qsub option, your job will receive a USR2
> > signal if qdel is called, and then your job has $notify of time to
> > do a checkpoint ;
> Yes, you can also redefine the signal.
Thank you for remembering me. This feature is explained in sge_conf man
page, but neither in qsub's nor in queue_conf's. I think it should be at
least mentioned in last two ones.
> > * don't mind the misleading mails sent by SGE.
I believe that, when starting, sge_execd should ask qmaster whether a
job is succesfully running (or has successfully run) before sending a
mail saying it failed. What do you think about this behavior?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users