[GE users] sge_execd exits badly even killed cleanly

massot bernard.massot at ens.fr
Mon Jan 25 15:08:58 GMT 2010

On Mon, Jan 25, 2010 at 03:15:03PM +0100, reuti wrote:
> Then the application-level interface will provide more options. Did  
> you check the Howto:
> http://gridengine.sunsource.net/howto/checkpointing.html
> Inside the checkpoint.sh you will need to send something like `kill - 
> usr1 -- -$job_pid` to send the signal to the complete process group.  
> The available variables are:
> char *ckpt_variables[] = {
>     "host",
>     "job_owner",
>     "job_id",
>     "job_name",
>     "queue",
>     "job_pid",
>     "ckpt_dir",
>     "ckpt_signal",
>     NULL
> };
> Did you find this anywhere documented? It woud be another issue to  
> have these noted at a proper location. Some are mentioned in the  
> README.* in the $SGE_ROOT/ckpt though.
Well, I saw some of them in
http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf but they
definitely deserve to be in sge_checkpoint manpage.
> To summarize: to generate different signals (and even a checkpoint  
> before the shutdown of the execd), I think the only working option is  
> to use the application-level interface and suspend the queue on this  
> machine before it is shut down. This way you have a chance that the  
> migrate script is called for sure, where you can send a different  
> signal to your application than it is send from checkpointing script.  
> The migrate script is not called at all for the shutdown of the execd  
Actually I feel my current system is working and doesn't really need all
this complicated configuration.
I'm going to teach my users what posix signals are and how to handle
them in a C program. Then I'll tell them that, if you use the "-ckpt
transp_usr1" option with qsub :
* every $min_cpu_interval your job receives a USR1 signal, which lets
  you some time to do a chekpoint, and you can use a bigger time
  interval using "-c" qsub option ;
* when your job receives a TERM signal it has 10 seconds to do a
  checkpoint, then it will be restarted on another node with same job id
  and $RESTARTED set to 1 ;
* if you use "-notify" qsub option, your job will receive a USR2 signal
  if qdel is called, and then your job has $notify of time to do a
  checkpoint ;
* in case of violent hardware failure, your job gets started on another
  node, with same job id and $RESTARTED set to 1 ;
* don't mind the misleading mails sent by SGE.

Do you think I'm missing something? My tests are rather convincing.
Bernard Massot


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list