[GE users] Qmaster stops starting jobs, nodes go "au"

Craig Tierney ctierney at hpti.com
Thu May 13 18:43:21 BST 2004


We are having a problem with our SGE server.  For the last
several days, the server gets in a state where it is unable
to start jobs and many of the nodes (20-40% of 800 nodes) 
go into either "u" or "au" state.  Jobs get stuck in the "t"
state as well.

We are running SGE 5.3p1.

Messages on the qmaster show that jobs are not being
delivered to the nodes:

Thu May 13 17:34:40 2004|qmaster|g0255|W|failed to deliver job 2178875.1
to queue "g0163.q"
Thu May 13 17:34:41 2004|qmaster|g0255|W|failed to deliver job 2178910.1
to queue "g0422.q"
Thu May 13 17:34:41 2004|qmaster|g0255|W|failed to deliver job 2178903.1
to queue "g0379.q"

Also the cpu usage of sge_commd goes up to near 100%.  The cpu
usage of sge_schedd goes up higher as well, but I suspect that
it is because of sge_commd.

During the high cpu usage of sge_commd, the error continually reports
messages like:

Thu May 13 17:27:55 2004|commd|g0255|W|select error: ignoring commproc
using fd 7 because the fd is ready to receive AND ready to send an EOF

We think the problem might be due to load on the NFS server
that SGE runs from.  However, the load isn't that high, so
we are not positive at this point.

Anyone seen anything like this?

Thanks,
Craig



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list