[GE users] SGE job failure detection

craffi dag at sonsorol.org
Wed Jun 3 17:31:57 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

For catching and acting on job failures you have found the right hook  
-- the epilog script. The problem as you have noted is if your epilog  
can successfully figure out if the problem was job or node related. I  
often use the epilog for this and there are certain classes of job  
failures where it is easy to grep the output and resubmit the job  
automatically if something known is seen (cliche example: license not  
found error).

However, the best system I've seen on a very large cluster for  
handling job loss due to node issues approaches the problem from a  
different perspective. It is far easier and more straightforward to  
have a cron script or some other system watchdog or probe script  
running on every node in the cluster that proactively searches for  
problems known to cause job issues and when found, it automatically  
disables the queue instance on the node and sends an alert message to  
the operators. By building an external system that can rapidly find  
problems and disable the queue instance they were able to massively  
reduce problems to the point now where 99% of job failures can be  
traced to user error rather than system problems.

The probe looks for things like:
- Missing NFS mounts
- Missing IP addresses and unusual error counts on physical system NICs
- system clock sync errors (can break MPI apps)
- permission errors
- UID/GID & pam authentication issues
- memory, software raid and s.m.a.r.t errors
- etc. etc. etc.

This particular group keeps careful track of every problem that  
results in a job failure and each time they encounter a new one they  
add a probe to their watchdog script that will identify that again in  
the future. It works and is beautiful to see in action. Within a  
minute or two of a problem occuring, the node has dropped out of the  
batch scheduler and will do "no harm" until an operator can reimage or  
look at it.

My $.02

-Chris



On Jun 3, 2009, at 12:21 PM, remi wrote:

> Hello,
>
> Very often we have some nodes that hang in our linux cluster. It can  
> cause the crash of jobs running on that node. It seems to be a  
> difficult problem to solve and due to the bad reactivity of hardware  
> support, our customers want to now if there is a way to detect the  
> failure of a job caused by a node and automatically launch the job  
> again in this case.
>
> Detect the crash of a job seems to be easy, we can maybe check the  
> return code in epilog, but detect the cause of the crash is more  
> difficult.
> We don?t want to restart each job that crash, because the main  
> reason of a crash is users? mistakes.
>
> Is anybody thought about a way to do that?
> Thank you for your help
>
> Best Regards,
>
> _________________________________________________
> Rémi CHAFFARD
> Société SOLUTEC
> On behalf of Volvo IT ? GI&O / SP / AP / PDEV
> Tel: +33 4 72 96 61 52
> Email: mailto:remi.chaffard at consultant.volvo.com
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200691

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list