[GE users] SGE job failure detection
remi.chaffard at consultant.volvo.com
Wed Jun 3 17:21:16 BST 2009
[ The following text is in the "Windows-1252" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Very often we have some nodes that hang in our linux cluster. It can cause the crash of jobs running on that node. It seems to be a difficult problem to solve and due to the bad reactivity of hardware support, our customers want to now if there is a way to detect the failure of a job caused by a node and automatically launch the job again in this case.
Detect the crash of a job seems to be easy, we can maybe check the return code in epilog, but detect the cause of the crash is more difficult.
We don?t want to restart each job that crash, because the main reason of a crash is users? mistakes.
Is anybody thought about a way to do that?
Thank you for your help
On behalf of Volvo IT ? GI&O / SP / AP / PDEV
Tel: +33 4 72 96 61 52
Email: mailto:remi.chaffard at consultant.volvo.com
More information about the gridengine-users