[GE users] SGE job failure detection

fx d.love at liverpool.ac.uk
Thu Jun 4 11:52:52 BST 2009

remi <remi.chaffard at consultant.volvo.com> writes:

> Thanks for this answer. 
> I think your approach is very good if the node is up and if we know
> the cause of some crash.

Also, in my experience it can be rather difficult to decide what to do
with your diagnostics automatically, i.e. whether necessarily to disable
the node, kill the job, &c.  The general idea is spot on, though.

> My problem is a little bit different, because we don't know the cause
> of the freeze. It means we can and a cron watchdog, it will not detect
> anything wrong; then the node freeze, so the node is unreachable (no
> more cron or other services).

That's the cause of many/most of our problems too.  The thing to do is
to use Nagios, or similar.  It can take arbitrary action on the
management node -- the head in our case -- not just raise an alert.
(You may have to use, say, sudo, to do it if nagios isn't running as

I've done little to automate recovery, though, so I can't help much with
code.  (I do at least aim to write a new check when I see a new failure
mode that wasn't already detected and seems worth the effort to check.)
By the way, regular checks running on the compute nodes could introduce
significant noise in the system, which is known to be relevant for MPI
performance.  If you do run a number of them, it may be worth using
something like CFengine or, probably better, Puppet.

> If not we can check if there were job running on that node and launch
> then again (I have to find a solution to check witch job ran on a node
> after the node freeze, maybe qacct).

qhost -j -h ...
when the node is still uncontactable.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list