[GE users] unavailable nodes and loadleveling

Craig Tierney ctierney at HPTI.com
Thu Apr 28 19:40:20 BST 2005

On Thu, 2005-04-28 at 12:31, Jiann-Ming Su wrote:
> I've been thrown into the SGE fire.  I'm responsible for maintaining
> an already running SGE cluster.  One of the problems I'm seeing is
> that jobs are not being dispersed to all nodes.  After doing a little
> bit of searching, it seems like not all of my nodes are available,
> even though they
> are physically up.  I run "qstat -j" and get the following for all the
> nodes that don't seem to be available.
>   queue instance "all.q at node16.mydomain.bogus" dropped because it is
> temporarily not available

This could mean many things.  Start with "qstat -j".  Look
at all of the queues there.  In particular, look for entries
in the last column.  Do you see and 'E' or 'd', or 'au'?  This
will help you see what is going on.

For each queue that has a problem, you can look in the messages
file to see what is going on.  The messages can be found on
the server in $SGE_ROOT/default/spool/qmaster/messages.  Grep
for each queue/host and see what it may say.  If the queue is
in E, it will give you a reason.

If the queues are in 'd' or 'au', it means something different.
If in 'd' you can enable the queue with 'qmod -e <queuename>'.
For 'au', the server is not talking to the queue.  Verify that
the sge_execd is running on the other host.  If so, you can
try restarting it and see what is going on.  If nothing is obvious,
use the dl.sh (or dl.csh depending on your shell) to start
the execd in debugging mode.

This should get you on your way.


> How do I verify a node's participation in the grid?  And, where are
> the config files located?  Qmon seems to be the preferred config tool,
> but I'm more comfortable editing text files.  Thanks for any tips.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list