[GE users] Job puts entire cluster into Error state over misplaced pid file? Help!

Bevan C. Bennett bevan at fulcrummicro.com
Wed Sep 12 02:15:23 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 11.09.2007 um 01:12 schrieb Bevan C. Bennett:
> 
>> Ok, I think I've got a handle on how things are and are not working 
>> currently.
>>
>> It turns out that one node started having segfault-inducing memory 
>> errors. This (correctly) caused the job to fail when spawned onto that 
>> system. The system is, it seems, trying to put the pid file where it 
>> wants to go (although at some point this seems to have switched from 
>> being in the central spool directory to being in the local nodes $TMP 
>> directory).
> 
> You, mean: the node is repaired, has good memory again and is still not 
> behaving correctly?

The node is currently out of the grid pending full diagnostics and repairs.

My remaining question is: When the node was marked into error state and the job 
re-scheduled to a new, healthy node, why did it give the same error it gave on 
the broken node and cause that healthy node to be set into the error state?

After that, why did it go on to do the exact same thing to every other healthy 
node in the cluster?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list