[GE users] Qmaster stops starting jobs, nodes go "au"

Magnus Söderberg magnus.soderberg at switchcore.com
Mon May 17 10:37:27 BST 2004


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Craig Tierney wrote:
> We are having a problem with our SGE server.  For the last several days, the server
> gets in a state where it is unable to start jobs and many of the nodes (20-40% of 800
> nodes) go into either "u" or "au" state.  Jobs get stuck in the "t" state as well.
....
> We think the problem might be due to load on the NFS server that SGE runs from.
> However, the load isn't that high, so we are not positive at this point.
> 
> Anyone seen anything like this?
Yupp, used to see that a lot.
We also had very slow response from e.g qstat & qsub commands, often failing with 
something like "Cannot send GDI message".
We're running SGE5.3p5 on a mix of linux & solaris with almost all disk NFS-mounted from a 
netapp.
The netapp used to have a 100M ethernet connection, but when it was upgraded to a Gb-link 
problems disappeared.

regards
-- 
Magnus Söderberg, M.Sc.E.E.
Staff Engineer
SwitchCore AB        Phone +46 46 2702560
Emdalavägen 18       Fax   +46 46 2702581
SE-223 69 Lund, Sweden

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list