[GE users] Job puts entire cluster into Error state over misplaced pid file? Help!

Bevan C. Bennett bevan at fulcrummicro.com
Tue Sep 11 00:12:42 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Ok, I think I've got a handle on how things are and are not working currently.

It turns out that one node started having segfault-inducing memory errors. This 
(correctly) caused the job to fail when spawned onto that system. The system is, 
it seems, trying to put the pid file where it wants to go (although at some 
point this seems to have switched from being in the central spool directory to 
being in the local nodes $TMP directory).

Now I need to figure out why the pid directory never gets recreated and jobs 
subsequently re-error when re-spawned to new, healthy, nodes.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list