[GE users] Nodes going 'au'

Reuti reuti at staff.uni-marburg.de
Fri Oct 29 15:34:32 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

> I have recently begun having a problem with SGE on my cluster.  SGE
> version 5.3p3.  My nodes are going into the 'au' state pretty often and
> I'd like to try to fix that.
> 
> Before hand nodes would only do this when users were bad, created an OOM
> condition or something similar.  Now nodes drop into 'au' if there's a
> decent amount of load on the node, but nothing overbearing.  I can't get
> the node back succesfully without rebooting it.
> 
> All nodes have a good network connection to the qmaster and the slave
> qmaster.  All nodes know each other by the same hostnames.  I have
> ignore_fqdn set to true as well.  If I restart the SGE deamons on the
> compute node in question without rebooting it I receive the following
> log about once a second:
> 
> Fri Oct 29 14:27:08 2004|execd|compute-1-11|W|can't receive request:
> READ ERROR
> 
> After a reboot it works fine tho.

anything intersting in /var/log/messages in the nodes? - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list