[GE users] sge_execd exits badly even killed cleanly

massot bernard.massot at ens.fr
Fri Jan 22 16:54:48 GMT 2010


I have a problem with sge_execd's behavior when it get stopped. When
killed with TERM signal it's supposed to end a clean way, but it's not
the case on my execution hosts.
When I reboot or shut down a computer (hence send a TERM signal to
sge_execd), jobs don't receive the USR1 signal whereas I configured a
checkpoint environment with interface=transparent, signal=USR1 and
when=xsmr. Checkpointing is working well in other contexts.
The most annoying side effect of this is that, even when a job is
properly rescheduled (my queues have rerun=true and cluster's conf has
reschedule_unknown set) and is actually running correctly, I receive
a "Job failed" e-mail from the host that was rebooted. It says
"failed before writing exit_status:shepherd exited with exit status 19:
before writing exit_status".

I use GE 6.2 on Linux Debian Lenny (using standard packages made by the
Debian team).

Can you think of an explanation?
Bernard Massot


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list