[GE users] sge_execd exits badly even killed cleanly

massot bernard.massot at ens.fr
Mon Jan 25 10:09:42 GMT 2010


On Fri, Jan 22, 2010 at 07:46:36PM +0100, reuti wrote:
> Am 22.01.2010 um 17:54 schrieb massot:
> > I have a problem with sge_execd's behavior when it get stopped. When
> > killed with TERM signal it's supposed to end a clean way, but it's not
> > the case on my execution hosts.
> > When I reboot or shut down a computer (hence send a TERM signal to
> > sge_execd),
> 
> when you reboot ot shutdown the machine it's too late, as already all  
> processes got the signals TERM/KILL (including network and alike -  
> how do you transfer the checkpoint file to shared space?).
In my case it's not too late if job doesn't have too much data to save.
On Debian relevent shutdown steps occure in the following chronological
order :
* send sge_execd the TERM signal ;
* send remaining processes the TERM signal (jobs run by SGE are in this
  category);
* wait for about 10 seconds ;
* send remaining processes the KILL signal ;
* unmount NFS ;
* shutdown network.
So if your job can save its data within 10 seconds, it's ok.
I thought sge_execd would send the USR1 signal to jobs when it receives
TERM signal, and then jobs would block TERM signal, save their data and
stop cleanly.
Actually even if it was working the expected way, I'd prefer not use the
when=s flag in my checkpointing environment because it wouldn't let job
differentiate between periodic checkpoints and computer shutdown (since
SGE always sends the same signal). Users will be able to decide whether
they want to save data or not, based on received signal.

> But there  is an issue anyway (last entry, I just checked in 6.2u5
> again):
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2045
I agree with your report. Moreover e-mail sent by sge_execd too is sent
when sge_execd get restarted, which can be a very long time after it
stopped.
It will be really awkward to receive an e-mail saying that your process
failed, a long time after it actually succeeded.
 
> When you use rerun and reschedule_unknown already, do you have any  
> need for the setup of a checkpointing environment?
I think rerun and reschedule_unknown are not used if your job doesn't
use a checkpointing environment with when=r flag, are they?
Anyway I need a checkpointing environment to have periodic backups. It's
easier than having programs using alarm() and SIGALRM.
-- 
Bernard Massot

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240852

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list