[GE users] sge_execd exits badly even killed cleanly

reuti reuti at staff.uni-marburg.de
Fri Jan 22 18:46:36 GMT 2010


Hi,

Am 22.01.2010 um 17:54 schrieb massot:

> I have a problem with sge_execd's behavior when it get stopped. When
> killed with TERM signal it's supposed to end a clean way, but it's not
> the case on my execution hosts.
> When I reboot or shut down a computer (hence send a TERM signal to
> sge_execd),

when you reboot ot shutdown the machine it's too late, as already all  
processes got the signals TERM/KILL (including network and alike -  
how do you transfer the checkpoint file to shared space?). But there  
is an issue anyway (last entry, I just checked in 6.2u5 again):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2045

As you use the transparent interface, the creation of the checkpoint  
has to be initiated before. But even for the application-level  
interface: it's to late when you already reboot or shutdown the  
machine to create and transfer the checkpoint information.


> jobs don't receive the USR1 signal whereas I configured a
> checkpoint environment with interface=transparent, signal=USR1 and
> when=xsmr. Checkpointing is working well in other contexts.
> The most annoying side effect of this is that, even when a job is
> properly rescheduled (my queues have rerun=true and cluster's conf has
> reschedule_unknown set) and is actually running correctly, I receive
> a "Job failed" e-mail from the host that was rebooted. It says
> "failed before writing exit_status:shepherd exited with exit status  
> 19:
> before writing exit_status".

Yes, it's a confirmation that something happened to the original run.  
When you use rerun and reschedule_unknown already, do you have any  
need for the setup of a checkpointing environment?

-- Reuti


>
> I use GE 6.2 on Linux Debian Lenny (using standard packages made by  
> the
> Debian team).
>
> Can you think of an explanation?
> -- 
> Bernard Massot
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=240413
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=240439

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list