[GE users] Nodes going 'au'

Jack Neely jjneely at pams.ncsu.edu
Fri Oct 29 15:29:33 BST 2004


Folks,

I have recently begun having a problem with SGE on my cluster.  SGE
version 5.3p3.  My nodes are going into the 'au' state pretty often and
I'd like to try to fix that.

Before hand nodes would only do this when users were bad, created an OOM
condition or something similar.  Now nodes drop into 'au' if there's a
decent amount of load on the node, but nothing overbearing.  I can't get
the node back succesfully without rebooting it.

All nodes have a good network connection to the qmaster and the slave
qmaster.  All nodes know each other by the same hostnames.  I have
ignore_fqdn set to true as well.  If I restart the SGE deamons on the
compute node in question without rebooting it I receive the following
log about once a second:

Fri Oct 29 14:27:08 2004|execd|compute-1-11|W|can't receive request:
READ ERROR

After a reboot it works fine tho.

Can anyone shed some light on this?

Jack Neely
-- 
Jack Neely <slack at quackmaster.net>
Realm Linux Administration and Development
PAMS Computer Operations at NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list