[GE users] SGE job failure detection

remi remi.chaffard at consultant.volvo.com
Wed Jun 3 17:52:50 BST 2009

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Thanks for this answer. 

I think your approach is very good if the node is up and if we know the cause of some crash.
My problem is a little bit different, because we don't know the cause of the freeze. It means we can and a cron watchdog, it will not detect anything wrong; then the node freeze, so the node is unreachable (no more cron or other services). In this case, the job running on that node crash, we did not see anything.
This is our main problem, we don't know the causes of these freeze so we cannot perform actions before the node crash.

Anyway, your solution gave me an idea. We can add a watchdog somewhere on a node (or 2 for redundancy), that will check each node on the cluster is reachable or not. If not we can check if there were job running on that node and launch then again (I have to find a solution to check witch job ran on a node after the node freeze, maybe qacct).

Thanks a lot.

-----Original Message-----
From: craffi [mailto:dag at sonsorol.org] 
Sent: mercredi 3 juin 2009 18:32
To: users at gridengine.sunsource.net
Subject: Re: [GE users] SGE job failure detection

For catching and acting on job failures you have found the right hook  
-- the epilog script. The problem as you have noted is if your epilog  
can successfully figure out if the problem was job or node related. I  
often use the epilog for this and there are certain classes of job  
failures where it is easy to grep the output and resubmit the job  
automatically if something known is seen (cliche example: license not  
found error).

However, the best system I've seen on a very large cluster for  
handling job loss due to node issues approaches the problem from a  
different perspective. It is far easier and more straightforward to  
have a cron script or some other system watchdog or probe script  
running on every node in the cluster that proactively searches for  
problems known to cause job issues and when found, it automatically  
disables the queue instance on the node and sends an alert message to  
the operators. By building an external system that can rapidly find  
problems and disable the queue instance they were able to massively  
reduce problems to the point now where 99% of job failures can be  
traced to user error rather than system problems.

The probe looks for things like:
- Missing NFS mounts
- Missing IP addresses and unusual error counts on physical system NICs
- system clock sync errors (can break MPI apps)
- permission errors
- UID/GID & pam authentication issues
- memory, software raid and s.m.a.r.t errors
- etc. etc. etc.

This particular group keeps careful track of every problem that  
results in a job failure and each time they encounter a new one they  
add a probe to their watchdog script that will identify that again in  
the future. It works and is beautiful to see in action. Within a  
minute or two of a problem occuring, the node has dropped out of the  
batch scheduler and will do "no harm" until an operator can reimage or  
look at it.

My $.02


On Jun 3, 2009, at 12:21 PM, remi wrote:

> Hello,
> Very often we have some nodes that hang in our linux cluster. It can  
> cause the crash of jobs running on that node. It seems to be a  
> difficult problem to solve and due to the bad reactivity of hardware  
> support, our customers want to now if there is a way to detect the  
> failure of a job caused by a node and automatically launch the job  
> again in this case.
> Detect the crash of a job seems to be easy, we can maybe check the  
> return code in epilog, but detect the cause of the crash is more  
> difficult.
> We don't want to restart each job that crash, because the main  
> reason of a crash is users' mistakes.
> Is anybody thought about a way to do that?
> Thank you for your help
> Best Regards,
> _________________________________________________
> Société SOLUTEC
> On behalf of Volvo IT - GI&O / SP / AP / PDEV
> Tel: +33 4 72 96 61 52
> Email: mailto:remi.chaffard at consultant.volvo.com


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list