[GE users] Job puts entire cluster into Error state over misplaced pid file? Help!

Reuti reuti at staff.uni-marburg.de
Tue Sep 11 10:02:08 BST 2007


Am 11.09.2007 um 01:12 schrieb Bevan C. Bennett:

> Ok, I think I've got a handle on how things are and are not working  
> currently.
>
> It turns out that one node started having segfault-inducing memory  
> errors. This (correctly) caused the job to fail when spawned onto  
> that system. The system is, it seems, trying to put the pid file  
> where it wants to go (although at some point this seems to have  
> switched from being in the central spool directory to being in the  
> local nodes $TMP directory).

You, mean: the node is repaired, has good memory again and is still  
not behaving correctly?

-- Reuti

> Now I need to figure out why the pid directory never gets recreated  
> and jobs subsequently re-error when re-spawned to new, healthy, nodes.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list