[GE users] Qmaster stops starting jobs, nodes go "au"

Ron Chen ron_chen_123 at yahoo.com
Fri May 14 13:11:57 BST 2004


Can you attach a debugger and find out why the commd
is looping? I think you can also dump the debug log of
commd to a file too.

Also, is it just the qmaster's commd that has problem?
And does restarting the commd help?

 -Ron

--- Craig Tierney <ctierney at hpti.com> wrote:
> We are having a problem with our SGE server.  For
> the last
> several days, the server gets in a state where it is
> unable
> to start jobs and many of the nodes (20-40% of 800
> nodes) 
> go into either "u" or "au" state.  Jobs get stuck in
> the "t"
> state as well.
> 
> We are running SGE 5.3p1.
> 
> Messages on the qmaster show that jobs are not being
> delivered to the nodes:
> 
> Thu May 13 17:34:40 2004|qmaster|g0255|W|failed to
> deliver job 2178875.1
> to queue "g0163.q"
> Thu May 13 17:34:41 2004|qmaster|g0255|W|failed to
> deliver job 2178910.1
> to queue "g0422.q"
> Thu May 13 17:34:41 2004|qmaster|g0255|W|failed to
> deliver job 2178903.1
> to queue "g0379.q"
> 
> Also the cpu usage of sge_commd goes up to near
> 100%.  The cpu
> usage of sge_schedd goes up higher as well, but I
> suspect that
> it is because of sge_commd.
> 
> During the high cpu usage of sge_commd, the error
> continually reports
> messages like:
> 
> Thu May 13 17:27:55 2004|commd|g0255|W|select error:
> ignoring commproc
> using fd 7 because the fd is ready to receive AND
> ready to send an EOF
> 
> We think the problem might be due to load on the NFS
> server
> that SGE runs from.  However, the load isn't that
> high, so
> we are not positive at this point.
> 
> Anyone seen anything like this?
> 
> Thanks,
> Craig
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 



	
		
__________________________________
Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.
http://promo.yahoo.com/sbc/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list